Moving is something that almost everyone experiences at least once in their lifetime. Whether it be for a job, family, or any other reason, it is important to understand the current state of the housing market. As college students who will soon be looking for jobs, understanding the housing market is important for us to consider while we think about relocation. This topic should not only be important to us, but everyone who is considering moving especially now with the current tightness of the housing market.
According to the Federal Reserve, the housing market has tightened considerably. The supply of homes available for sale has fallen to historically low levels and home price growth has increased greatly during the pandemic. In this tutorial, we aim to analyze how home values have changed in the 21st century across the United States. We will be measuring the changes in Zillow Home Value Index (ZHVI) which is a seasonally adjusted measure of typical home value and market changes.
ZHVI is a measure provided by Zillow Inc. that measures two key variables in the current housing market as well as over time. These two variables are home value and housing market appreciation. For more details about ZHVI and how it is calculated, visit this link. Another link that may be helpful is the ZHVI User Guide.
Using the ZHVI dataset, our goal is to predict future ZHVI and see whether whether the size of the region affects its ZHVI. In order to do this, we will be going through the data science lifecycle which has five parts:
This data was obtained from https://www.zillow.com/research/data/ under Home Values.
First, let's import some libraries that will be using:
If you do not have any of these libraries installed, you can install them by entering pip3 install [package]. If you need any more information, you can see the documentation or a tutorial for each library listed below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
import statsmodels.formula.api
raw_data = pd.read_csv("data.csv")
raw_data.head()
| RegionID | SizeRank | RegionName | RegionType | StateName | 2000-01-31 | 2000-02-29 | 2000-03-31 | 2000-04-30 | 2000-05-31 | ... | 2021-02-28 | 2021-03-31 | 2021-04-30 | 2021-05-31 | 2021-06-30 | 2021-07-31 | 2021-08-31 | 2021-09-30 | 2021-10-31 | 2021-11-30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 102001 | 0 | United States | Country | NaN | 127104.0 | 127448.0 | 127809.0 | 128546.0 | 129288.0 | ... | 274766.0 | 278419.0 | 282276.0 | 286980.0 | 292503.0 | 298382.0 | 303772.0 | 308393.0 | 312486.0 | 316368.0 |
| 1 | 394913 | 1 | New York, NY | Msa | NY | 223875.0 | 225213.0 | 226416.0 | 228785.0 | 230991.0 | ... | 520343.0 | 524700.0 | 529431.0 | 534932.0 | 542083.0 | 549998.0 | 557578.0 | 563469.0 | 568010.0 | 571556.0 |
| 2 | 753899 | 2 | Los Angeles-Long Beach-Anaheim, CA | Msa | CA | 231151.0 | 231956.0 | 233189.0 | 235533.0 | 238046.0 | ... | 748563.0 | 756432.0 | 767475.0 | 783139.0 | 802944.0 | 823842.0 | 837285.0 | 846395.0 | 851153.0 | 858357.0 |
| 3 | 394463 | 3 | Chicago, IL | Msa | IL | 169017.0 | 169416.0 | 169932.0 | 170965.0 | 172060.0 | ... | 258870.0 | 260970.0 | 263993.0 | 266728.0 | 270524.0 | 274664.0 | 278755.0 | 281808.0 | 284452.0 | 287131.0 |
| 4 | 394514 | 4 | Dallas-Fort Worth, TX | Msa | TX | 130276.0 | 130380.0 | 130466.0 | 130678.0 | 130900.0 | ... | 280524.0 | 284541.0 | 290061.0 | 296469.0 | 303787.0 | 311586.0 | 319478.0 | 326661.0 | 332734.0 | 338194.0 |
5 rows × 268 columns
Now, let's clean the data so we can extract only the relevant details such as Dates, Locations and the ZHVI. By getting rid of unecessary information, we get a much cleaner dataset that is easier to read.
# Melting the dates and converting dates to datetime
data = raw_data.melt(id_vars=["RegionName", "StateName", "RegionID", "SizeRank", "RegionType"],
var_name = "Date",
value_name ="Zillow Home Value Index")
data["Date"] = pd.to_datetime(data['Date'], format='%Y/%m/%d')
# Dropping unnecessary columns
data.drop(["RegionID", "RegionType"], axis = 1, inplace = True)
data.dropna(subset = ["StateName"], inplace = True)
data.head()
| RegionName | StateName | SizeRank | Date | Zillow Home Value Index | |
|---|---|---|---|---|---|
| 1 | New York, NY | NY | 1 | 2000-01-31 | 223875.0 |
| 2 | Los Angeles-Long Beach-Anaheim, CA | CA | 2 | 2000-01-31 | 231151.0 |
| 3 | Chicago, IL | IL | 3 | 2000-01-31 | 169017.0 |
| 4 | Dallas-Fort Worth, TX | TX | 4 | 2000-01-31 | 130276.0 |
| 5 | Philadelphia, PA | PA | 5 | 2000-01-31 | 129615.0 |
The data now looks clean but there could be some missing values. Let's check! Using the Pandas library, we can check if there are any missing values in the dataframe using the code below.
data["Zillow Home Value Index"].isna().sum()
48727
In order to impute the missing values, we are going to calculate the average ZHVI for the whole state on that date.
# in order to impute the n/a values, calculate the avg index for that state on that date
avg_state_date = data.groupby(["StateName", "Date"]).mean()
avg_state_date.reset_index(level = ["StateName", "Date"], inplace = True)
avg_state_date.rename(columns ={"Zillow Home Value Index": "Average Zillow Home Value Index for State on Date"}, inplace = True)
avg_state_date.drop(["SizeRank"], axis = 1, inplace = True)
# Merging back to original dataframe and filling the holes
data = pd.merge(data, avg_state_date, how = "left", on = ["StateName", "Date"])
data["Zillow Home Value Index"].fillna(data["Average Zillow Home Value Index for State on Date"], inplace = True)
data["Zillow Home Value Index"].isna().sum()
3468
There are still some missing values. Let's fill those with average ZHVI for the whole region across our window of time. After we fill in these values, we now have 0 missing values and our dataset is almost ready to be explored.
avg_region = data.groupby(["RegionName"]).mean()
avg_region.drop(["Average Zillow Home Value Index for State on Date", "SizeRank"], axis =1, inplace = True)
avg_region.reset_index(level = ["RegionName"], inplace = True)
avg_region.rename(columns ={"Zillow Home Value Index": "Average Zillow Home Value Index for Region across Time"}, inplace = True)
data = pd.merge(data, avg_region, how = "left", on = ["RegionName"])
data["Zillow Home Value Index"].fillna(data["Average Zillow Home Value Index for Region across Time"], inplace = True)
data["Zillow Home Value Index"].isna().sum()
0
Let's just calculate the average ZHVI for states as well.
avg_state = data.groupby(["StateName"]).mean()
avg_state.drop(["Average Zillow Home Value Index for State on Date", "SizeRank", "Average Zillow Home Value Index for Region across Time"], axis = 1, inplace = True)
avg_state.reset_index(level = ["StateName"], inplace = True)
avg_state.rename(columns ={"Zillow Home Value Index": "Average Zillow Home Value Index for State across Time"}, inplace = True)
data = pd.merge(data, avg_state, how = "left", on = ["StateName"])
data.head()
| RegionName | StateName | SizeRank | Date | Zillow Home Value Index | Average Zillow Home Value Index for State on Date | Average Zillow Home Value Index for Region across Time | Average Zillow Home Value Index for State across Time | |
|---|---|---|---|---|---|---|---|---|
| 0 | New York, NY | NY | 1 | 2000-01-31 | 223875.0 | 79925.461538 | 407514.752852 | 131142.310067 |
| 1 | Los Angeles-Long Beach-Anaheim, CA | CA | 2 | 2000-01-31 | 231151.0 | 221478.900000 | 505735.870722 | 358798.080588 |
| 2 | Chicago, IL | IL | 3 | 2000-01-31 | 169017.0 | 90254.285714 | 222440.041825 | 107777.913740 |
| 3 | Dallas-Fort Worth, TX | TX | 4 | 2000-01-31 | 130276.0 | 103721.555556 | 177825.041825 | 121059.292274 |
| 4 | Philadelphia, PA | PA | 5 | 2000-01-31 | 129615.0 | 92338.666667 | 216507.528517 | 133913.688633 |
Finally, the data is ready to be disected. Lets move on to anaylsis.
To start off we are going to visualize how the ZHVI changed per state over time. To do this, we are going to use the Matplotlib library to create a scatterplot with each datapoint on it.
n = data.StateName.nunique()
distinct_colors = plt.cm.gist_rainbow(np.linspace(0, 1, n))
fig, ax = plt.subplots(figsize = (150, 100))
states = data.groupby(["StateName"])
i = 0
for name, group in states:
ax.scatter(x = "Date",y = "Zillow Home Value Index", data = group, label = name, s = 500, color = distinct_colors[i])
i = i+1
ax.legend(fontsize = 60)
plt.title("Zillow Home Value Index per State over Time", fontsize = 150)
plt.xlabel("Date", fontsize = 100)
plt.xticks(fontsize = 40)
plt.ylabel("ZHVI in millions", fontsize = 100)
plt.yticks(fontsize = 40);